EXPLORATORY DATA ANALYSIS OF RED WINE data by Ramya G

Description about dataset

The dataset on red wine is about 1599 observations and 13 variables in dimension. The observations represent different varieties of red wine and variables represent the chemical characteristics that influence the quality of red wine.

Structure of dataset

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

So, mostly the variables are of numeric datatype except for X and quality variables, which are integers.

The variable X represents serial number for each wine and does not play any role in influencing the quality of wines.

Since quality is the score given to wines, it can be converted into factor variable with different levels. So lets create new variable quality.f, which is a factor.

## [1] "3" "4" "5" "6" "7" "8"
## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.f           : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

So, the quality.f feature now has scores of wines ranging with levels from 3 to 8.

Let’s have a look at most common score among the wines by plotting a frequency distribution using histograms

Univariate Plots Section

Quality plot

The quality score ranges from 3 to 8, with most of the wines having average scores of 5 and 6.

Number of obervations with average quality wines 5 and 6 respectively are 681 and 638.

## [1] 681
## [1] 638

Number of wines with quality score of 7 and 8 respectively are 199 and 18.

## [1] 18
## [1] 199

Number of poor quality wines with a rating 3 in the dataset are 10

## [1] 10

Now, lets visualize distribution curves for each variable by plotting histograms

Creating function for frequency distribution.

Fixed acidity plot

From the graph we can see that fixed acidity approximately follows normal distribution. It has some tailing towards right and is positively skewed. This might be due to presence of outliers

Checking for outliers with boxplot stats

##  [1] 12.8 12.8 15.0 15.0 12.5 13.3 13.4 12.4 12.5 13.8 13.5 12.6 12.5 12.8
## [15] 12.8 14.0 13.7 13.7 12.7 12.5 12.8 12.6 15.6 12.5 13.0 12.5 13.3 12.4
## [29] 12.5 12.9 14.3 12.4 15.5 15.5 15.6 13.0 12.7 13.0 12.7 12.4 12.7 13.2
## [43] 13.2 13.2 15.9 13.3 12.9 12.6 12.6

Let’s try plotting log transformed histogram to see if it improves tailing.

fa + scale_x_log10() + labs(x= 'log10 Fixed Acidity' ) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Log transformed plot of fixed acidity now shows almost normal distribution.

Volatile acidity plot

The plot is positively skewed. Lets check for outliers.

##  [1] 1.130 1.020 1.070 1.330 1.330 1.040 1.090 1.040 1.240 1.185 1.020
## [12] 1.035 1.025 1.115 1.020 1.020 1.580 1.180 1.040

The log transformed volaitle acidity plot now shows negative skewing. But, it has better distribution compared to the original volatile acidity plot.

Citric acid plot

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
boxplot.stats(wine$citric.acid)$out
## [1] 1

As can be seen, there is only one outlier in this variable. So the unusual distribution that citric acid variable follows is not due to outliers. It is because there are more than 250 wines with citric acid values between 0 and 0.125.

Residual sugar distribution

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
## [1]  0.9 15.5

The residual sugar plot has a long tail. The range of this plot is huge with mean and median around 2. Also, 75% of data falls below 3. This definitely has outliers. So lets have a look at them.

##   [1]  6.10  6.10  3.80  3.90  4.40 10.70  5.50  5.90  5.90  3.80  5.10
##  [12]  4.65  4.65  5.50  5.50  5.50  5.50  7.30  7.20  3.80  5.60  4.00
##  [23]  4.00  4.00  4.00  7.00  4.00  4.00  6.40  5.60  5.60 11.00 11.00
##  [34]  4.50  4.80  5.80  5.80  3.80  4.40  6.20  4.20  7.90  7.90  3.70
##  [45]  4.50  6.70  6.60  3.70  5.20 15.50  4.10  8.30  6.55  6.55  4.60
##  [56]  6.10  4.30  5.80  5.15  6.30  4.20  4.20  4.60  4.20  4.60  4.30
##  [67]  4.30  7.90  4.60  5.10  5.60  5.60  6.00  8.60  7.50  4.40  4.25
##  [78]  6.00  3.90  4.20  4.00  4.00  4.00  6.60  6.00  6.00  3.80  9.00
##  [89]  4.60  8.80  8.80  5.00  3.80  4.10  5.90  4.10  6.20  8.90  4.00
## [100]  3.90  4.00  8.10  8.10  6.40  6.40  8.30  8.30  4.70  5.50  5.50
## [111]  4.30  5.50  3.70  6.20  5.60  7.80  4.60  5.80  4.10 12.90  4.30
## [122] 13.40  4.80  6.30  4.50  4.50  4.30  4.30  3.90  3.80  5.40  3.80
## [133]  6.10  3.90  5.10  5.10  3.90 15.40 15.40  4.80  5.20  5.20  3.75
## [144] 13.80 13.80  5.70  4.30  4.10  4.10  4.40  3.70  6.70 13.90  5.10
## [155]  7.80

So, the residual sugar variable has huge number(145) of outliers compared to other explanatory variables.

Let us scale the plot to remove outliers and observe the distribution

So, we can say that the skewness in the earlier plot was due to outliers and now residual sugar variable follows normal distribution.

Chlorides plot

We can say that majority of the data falls between 0 and 0.15.

Detecting outliers

##   [1] 0.176 0.170 0.368 0.341 0.172 0.332 0.464 0.401 0.467 0.122 0.178
##  [12] 0.146 0.236 0.610 0.360 0.270 0.039 0.337 0.263 0.611 0.358 0.343
##  [23] 0.186 0.213 0.214 0.121 0.122 0.122 0.128 0.120 0.159 0.124 0.122
##  [34] 0.122 0.174 0.121 0.127 0.413 0.152 0.152 0.125 0.122 0.200 0.171
##  [45] 0.226 0.226 0.250 0.148 0.122 0.124 0.124 0.143 0.222 0.039 0.157
##  [56] 0.422 0.034 0.387 0.415 0.157 0.157 0.243 0.241 0.190 0.132 0.126
##  [67] 0.038 0.165 0.145 0.147 0.012 0.012 0.039 0.194 0.132 0.161 0.120
##  [78] 0.120 0.123 0.123 0.414 0.216 0.171 0.178 0.369 0.166 0.166 0.136
##  [89] 0.132 0.132 0.123 0.123 0.123 0.403 0.137 0.414 0.166 0.168 0.415
## [100] 0.153 0.415 0.267 0.123 0.214 0.214 0.169 0.205 0.205 0.039 0.235
## [111] 0.230 0.038

Scaling the x- axis by removing outliers

Now it follows perfect normal distribution

Free Sulphur dioxide plot

Summary statistics and outliers of free sulfur dioxide variable

The range is huge with a value of 70 and moreover, 75% of data has free sulfur dioxide content close to mean value. So the plot is right skewed and the maximum value might be an outlier.

##  [1] 52 51 50 68 68 43 47 54 46 45 53 52 51 45 57 50 45 48 43 48 72 43 51
## [24] 51 52 55 55 48 48 66

Let’s create a log transformed plot for free sulphurdioxide

fs+ scale_x_log10() + labs(x= 'log10 Free Sulphur dioxide' )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The first histogram was asymmetric with a right tail and the log of histogram now has comparably short left tail and looks bimodal as it has peaks at 5 and 10 values.

Total Sulfurdioxide plot

Blue line represents mean value. From the shape of the plot, looks like there are outliers. Let’s do boxplot to detect outliers in this variable

##  [1] 145 148 136 125 140 136 133 153 134 141 129 128 129 128 143 144 127
## [18] 126 145 144 135 165 124 124 134 124 129 151 133 142 149 147 145 148
## [35] 155 151 152 125 127 139 143 144 130 278 289 135 160 141 141 133 147
## [52] 147 131 131 131

We can see clearly that there are outliers in this variable. Lets check the outliers in detail

From the graph, we can say that most of th wines have total SO2 content less than mean and median values. For that matter, 50% of the wines have total sulfur dioxide less than their mean value. Blue line is the mean value.

pH Plot

pH curve looks normally distributed.

Density plot

As can be seen, pH and Density follow perfect normal distribution.

Observations:

  1. 80% of the wines in the dataset are average quality wines with a rating of 5 and 6.
  2. The poor and best quality wines contribute to less than 2% of the data collectively.
  3. Of all the variables observed for distribution, pH and density follow normal distribution in the histogram plotted.
  4. Almost all the variables had outliers leading to skewed histograms. Variables like fixed acidity, and volatile acidity showed improvement in their distributions with the log transformed plots. However, residual sugar and chlorides had huge number of outliers and thier axis was scaled to reduce tailing.

Univariate Analysis

What is the structure of your dataset?

The original dataset is of 1599 observations with 13 variables. The variables are all numerical except that of first(X, which is number id for wine variety) and last (quality) variables, which are integers. Later, a new variable 'quality.f' was created, a factor feature with six levels. Hence, the later dataset has a dimensions of 1599 observations and 14 variables.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the data is the variable quality, which is also the response variable in the dataset.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think variables related to acid characteristics like fixed acidity, citric acid content, volatile acidity of wines and sugar content are going influence quality of wine.

I am especially looking to learn about relationship between citric acid and quality of wine as there are many wines with low citric acid content.

Did you create any new variables from existing variables in the dataset?

Yes, as quality is the rating given to different red wines and since it ranges between only 3 and 8 values for all the observations it can be converted into factor. So, a new variable ‘quality.f’ was created, a factor feature with six levels.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Yes, looking at the distributions of variables I checked for outliers scaled the data to remove these. I did this in order to see to what extent outliers are affecting the distribution curves. For instance, in the case of residual sugar and chlorides removing outliers by scaling the plot resulted in perfect normal distributions whereas, in the case of citric acid it was observed that its unusual distribution is not due to outliers.

Bivariate Plots Section

Firstly, let us observe relationship between different variables in the dataset. For that purpose, let’s create a correlation matrix using corrplot. This matrix shows the direction of correlation and the correlation coefficients. Method used for plotting correlation matrix is Pearson’s correlation.

Creating correlation matrix

Red and blue squares indicates negative and positive correlations respectively.

From this we can see that, our main feature of interest, Quality has strongest correlations with alcohol, volatile acidity, citric acid and sulphates.

NOTE: The values in the correlation matrix are rounded up to one digit after decimal.

We can also observe that features other than quality also have strong correlations in between them. For instance, fixed acidity has a approximately 0.7 (0.67), strong positive correlation coefficient of with density and citric acid, and of -0.68 with pH.

Now, let’s plot scatterplot matrix to visualize the plots between different variables in the dataset.

SCATTERPLOT MATRIX FOR RED WINE DATASET

Now that we have got an idea about relations between various features,lets explore them more with bivariate plots.

To start with, lets take a look at plots of quality with its strongly realted features.

Functions for scatterplot and boxplot are created and now let’s use them to make plots.

Plots of quality vs alcohol and quality vs volatile acidity

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

Quality has moderate positive correlation with alcohol and moderate negative correlation volatile acidity.

From both plots and five number summaries, we can say that

  1. As the mean alcohol concentration increases, quality of the wines also improves except for quality score 5.

  2. With decrease in volatile acidity, quality of wines is observed to improve.

Plots of quality vs citric acid and quality vs sulphates

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

From both plots and five number summaries, we can see that

  1. a weak positive correlation (0.22 rounded to 0.2 in cor. matrix) exists between quality and citric acid. There is a slight increase in mean citric acid content in better quality wines.

This positive correlation is also evident fom the median citric acid values as they showed nice increase for each quality score.

We have earlier seen in the univariate section that there were many wines with citric acid concentration <0.125 in them. Here, it is interesting to observe that those were poor quality wines with scores of 3 and 4 that had lower levels of citric acid in them. Note that, the median citric acid values for scores 3 and 4 are <0.1 and their mean values are around 0.17 whereas median values for scores of 7 and 8 are 0.4, 0.42 respectively.

  1. as the mean sulphates concentration increases, the quality also improves. But this is also a weak positive correlation (0.25 rounded to 0.3 in cor. matrix).

As we know that alcohol, volatile acidity, citric acid and sulphates have the ability to influence quality of wines, let’s take a look at how these variables are related to one another. Let’s explore the variables that show correlations >=0.3 in the correlation matrix.

Let’s start by creating a function for scatter plot and use that to make plots.

Scatter plots between variables, that have strongest correlation to Quality

From these plots, we can say that, wines with higher citric acid and sulphate concentrations have lower volatile acidities and that there is a positive correlation between sulphates and citric acid. Note that, sulphates have both weak to moderate correlations with citric acid and volatile acidity.

As these features have an affect on wine qualities, it will be interesting to look at these relations for every quality of wine. Let us explore them more under multivariate analysis.

Now, let’s visualise the relationships between other features that have inter variable correlations >= 0.5.

Plots of fixed acidity with citric acid, density, and pH

From the above plots, it is evident that fixed acidity has strong correlations with citric acid, density and pH. So as the citric acid content and densities increase, the fixed acidities of wines shows a decrease in trend and this inturn causes pH of wines to rise up.

Scatterplot for density with alcohol, and residual sugar; citric acid and pH; Free \(SO_2\) and Total \(SO_2\)

There’s moderate positive correlation between alcohol and density, and density with residual sugar. So, as the alcohol content in wine increases, its density reduces and as residual sugar content increases in wine, its density also rises up.

Free Sulphur dioxide and Total Sulphur dioxide show strong positive correaltions. This is obvious as free sulphur dioxide is a subset of total sulphur dioxide.

As obviously, with increase in citric acid content, pH of wines reduces.

A moderate positive correlation exists between density and residual sugar and citric acid. I initially thought residual sugar is going to influence the quality of wines. But, from the correlation matrix, we can see that it has no correlation with quality as such.

  • Observations from Bivariate Analysis:

    • Features such as alcohol and volatile acidity show moderate positive and negative correlations with quality respectively, and sulphates and citric acid are weak positively corrrelated with quality. Rest all features show very weak to no correaltions with our main feature of interest.

    • There are some strong intervariable correlations observed in the red wine dataset. Fixed acidity one such feature which has strong correlations with density, citric acid and pH. Also, free sulphur dioxide and total sulphur dioxide are also strongly correlated with each other.

    • Residual sugar has only one notable correlation with density and it is interesting to observe that it has no affect on quality of wines.

    • Also, there are some interesting correlations between the variables that affect quality.Volatile acidity shows negative correlations with citric acid and sulphates, while sulphates are positively correlated with citric acid. That is, wines with higher citric acid and sulphate concentrations have lower volatile acidities.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Our main feature of interest, quality has positive correlations with alcohol, citric acid and sulphates. So, higher concentraions of these in wines yields better quality. However, it is moderately negatively correlated with volatile acidity.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

There were some strong correlations inter variable realtionships between fixed acidity and density, citric acid, and pH. These were very much stronger than their respective correlations with quality, the response variable.

I expected that there would be some relationship between residual sugar and quality of wines but surprisingly, there was nothing as such in between them.

What was the strongest relationship you found?

In the entire dataset, strongest positive relationship of 0.67(~0.7) was observed between fixed acidity and citric acid, which is followed very closed followed by fixed acidity and density 0.668 and total sulphur dioxide and free sulphur dioxide.

Strongest negative correlation was found to be in between fixed acidity and pH, which was -0.68.

Multivariate Plots Section

In multivariate analysis, I would like to explore the relationships between different variables with respect to quality.

First, let’s start multivariate analysis by creating functions for scatter plots

We created two functions to visualize scatter plots one by coloring quality.f and the other by facetting by the same categorical variable (quality.f).

Plots between variables showing strong correlations with quality

From the scatter plot, we can see that almost 75% of the green dots are localised in the left area of the plot i.e., <0.5 volatiles acidity and >=0.7 sulphates.

It means that wines with higher qualities are having higher levels of sulphates and lower volatile acidities compared to the poor quality wines.

This is an interesting plot, where both alcohol and volatile acidity show influence on quality of wines. We can clearly see that higher quality wines with scores of 7 and 8 (green dots) have higher alcohol content (abobe 10%) and lower volatile acidities(<0.6) in them.

Volatile acidity and alcohol are weakly negatively correalted with each other. Facetted plots show that there is hardly any correlation between volatile acidity and alcohol content for quality scores of 4, 5, 6 and 7, while scores with 3 and 8 show strong and moderate correlations respectively. Overall, they show a correlation of -0.2.

This is a very good plot where we can observe nice intervariable correlations between two of the strongest correlated variables to quality.

From the facetted plot, we can observe that for every grade of quality, with higher citric acid levels, volatile acidity values show a decrease in trend.

From the scatter plot, we can clearly see that majority of green dots are concentrated in the lower right portion of the plot. Implying that high citric content and low volatile acidity levels produces best quality wines.

Majority of green dots in the scatter plot are concentrated towards upper right side of the graph, indicating that high levels of citric acid and sulphates lead to better quality wines.

Now, let us fix the variables alcohol and citric acid and look at their strongest correlations with respect to Quality.

NOTE: Here, we are not exploring volatile acidity and sulphates with their other correlated variables in terms of quality as their correlation coefficients fall below 0.5.

From the above graphs, we can clearly see the influence of alcohol and density on quality of wines. Majority of the better quality wines with score of 7 and 8 have higher alcohol content(>10% alcohol). However, there is not much influence of density on quality of wines. So, even the better wines are spread between densities of 0.99 and 1 with most of them having lower densities (< 0.997). This proves the weak negative correlation of density with quality.

Also, from the facetted plots, we can see that alcohol and density are negatively correlated for all the quality scores with R values >-0.3 except for score 5, where they are weakly negatively correlated.

For every grade of quality, citric acid is atleast moderately correlated with pH. In fact, for better quality wines, it shows negative correlation of > -0.7 which is strong.

Better quality wines are mostly located on the right side of the graph towards higher citric acid values with pH values between 3 and 3.5.

From the plot, we can clearly see that, most of the best quality wines (green dots) are towards the right side of graph whereas, pink dots are located in the left portion of the graph.

Also, when we closely observe, we can see that there are more green dots in the upper right portion than pink ones. This is due to weak positive correlation between fixed acidity and quality.

Observations:

* Wines that contained higher amounts of alcohol, citric acid, and sulphates were observed to have better qualities over others.

* Also, having lower levels volatile acidity and density values led to better quality wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

There were some strong correlations between citric acid and pH for every quality of wine, from poor to the best ones. Infact, for the highest quality scores of 7 and 8, these correlations were closer to 0.8

Were there any interesting or surprising interactions between features?

There were nothing much surprising revealations about the feature interactions in this section. But,it was interesting to see the interactions between volatile acidity and citric acid and to visualize their influence on quality of wines.

Final Plots and Summary

Plot One

Histogram of Quality

Description One

This is the histogram for our main feature of interest, Quality (quality.f) in the redwine dataset. From this we can clearly see peaks at the scores of 5 and 6, implying that majority (80%) of the data collected has average quality wines. The top and poor quality wines with scores of 8 and 3 respectively contribute to only about 1% of the data each.

Plot Two

Affect of Alcohol on Quality score of wines

Description Two

These are the important plots of this analyses as they reveal the affect of alcohol, the strongest correlated feature to quality. The impact of alcohol on quality of wines can be easily visualized from the above plots. So, with the increase in alcohol content in redwine, quality of wines improved. The trend of red dots show the increase in mean alcohol content with every quality of wine(especially from 5 to 8 quality scores the trend can be clearly seen).

Plot Three

Description Three

This is the plot where we can clearly observe the interactions between the two explanatory variables along with our response variable, Quality . So, from this plot, we can visualize that increase in citric acid leads to reducing volatile acidity levels of red wine, inturn resulting in achieving higher quality wines.

Reflection

This is an interesting project where I got to explore not only redwine dataset but also the impressive ggplot package in R.

So, the redwine dataset originally consisted of 1599 observations and 13 variables. I did univariate exploration to find out the distribution of each variable in the dataset. This is where I got to know that most of the wines in dataset where average quality wines with scores of 5 and 6.

The bivariate analysis was done to determine the interactions between two different variables. This was an interesting section where I found some strong correlations of fixed acidity with density, citric acid, and pH. This was also the section where I found the variables that are strongly correlated with our main feature Quality. It was surprising to see that very few variables are correlated to Quality and even the strongest of its correlation was of moderate strength(Alcohol and Quality 0.5). Moreover, I initially thought that residual sugars are going to have some influence on the quality of redwine. However, to my surprise I found out that sugars don’t have any affect on quality of the redwine.

Nextly, multivariate explorations were done to explore three variables together and understand the combined influence of variables on quality of wines. Moreover, I also visualized the interactions between two different variables for every quality of wine with facetting them by Quality. This section revealed to me that, having high citric acid levels reduces levels of volatile acidity and improves Quality of wines. Moreover, higher amounts of alcohol and sulphates improves wine quality.

In the process of exploration, I felt that dataset was indeed a small one with some quality scores having very few observations. For instance, a score of 3 had only 10 observations(<1% of data) and 8 had only 18 observations (about 1.5% of data) in the entire dataset. So, I felt that data was insufficient to correctly determine the interactions between features. This was because in mutivariate section I found, that there were some unusualspikes in correlation coefficient values for some features for the scores of 3 and 8. For instance, during multivariate analysis, in the case of volatile acidity and alcohol, there were hardly any correlations between these features for every quality score except for 3 and 8, where there were surprisingly strong and moderate correlations. I believe this was due to very fewer number of observations in these two cases. So, I feel more data needs to be collected in future and also adding information on fruit type(grape), age of wines and their organoleptic properties like smell and taste would make the dataset more interesting to analyse.